78 ◾ Bioinformatics
mkdir indexdir
STAR --runThreadN 4 \
--runMode genomeGenerate \
--genomeDir indexdir \
--genomeFastaFiles ucscref/hg38.fa \
--sjdbGTFfile ucscref/hg38.ncbiRefSeq.gtf \
--sjdbOverhang 100
Above, with the “STAR” command, we used “--runThreadN” to specify the number of
threads used for indexing, “--runMode genomeGenerate” to tell the command that
we wish to generate a genome index, “--genomeDir” to specify the directory where the
index files are to be saved, “--genomeFastaFiles” to specify the file path of the reference
genome FASTA file, “--sjdbGTFfile” to specify the file path of the annotation GTF file, and
“--sjdbOverhang” to specify the length of the genomic read around the annotated junction
to be used in constructing the splice junctions database. For this option, we can provide
read size minus one (n-1) if the read size is equal for all reads; otherwise, we can provide
the maximum size minus one.
The process of indexing may take a long time and may consume much memory and stor-
age space compared to the other aligners. Several files will be generated including binary
genome sequence files, files of the suffix arrays, a text file for the chromosome names or
lengths, splice junctions’ coordinates, and transcripts/genes information. Those files are
for the STAR internal use; however, the chromosome names can be renamed in the chro-
mosome file if needed.
The next step is to use STAR command for aligning the reads. This time we will use
“--runMode alignReads” to tell the program to run read mapping mode, “outSAMtype
BAM Unsorted” to generate an unsorted BAM file, “--readFilesCommand zcat” to tell the
program that the FASTQ files are compressed, “--genomeDir” to specify the index direc-
tory, “--outFileNamePrefix” to specify the prefix for the output files, and “--readFilesIn” to
specify the FASTQ file names. You can also set “outSAMtype BAM SortedByCoordinate”
to generate a BAM file sorted by the alignment coordinates. However, that will exhaust the
memory of a 32G-RAM computer.
STAR --runThreadN 4 \
--runMode alignReads \
--outSAMtype BAM Unsorted \
--readFilesCommand zcat \
--genomeDir indexdir \
--outFileNamePrefix STARoutput/SRR769545 \
--readFilesIn data/SRR769545_1.fastq.gz data/SRR769545_2.
fastq.gz
STAR alignment mode produces a BAM file, containing read alignment information,
and four text files, three log files with file names “*Log.out”, “*Log.progres.out”, and
“*Log.final.out”, where “*” is for the prefix specified by “--outFileNamePrefix” option,